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Abstract 

This paper proposes a new distance metric between clus- 
terings that incorporates information about the spatial dis- 
tribution of points and clusters. Our approach builds on 
the idea of a Hilbert space-based representation of clusters 
as a combination of the representations of their constituent 
points. We use this representation and the underlying metric 
to design a spatially- aware consensus clustering procedure. 
This consensus procedure is implemented via a novel reduc- 
tion to Euclidean clustering, and is both simple and efficient. 
All of our results apply to both soft and hard clusterings. 
We accompany these algorithms with a detailed experimen- 
tal evaluation that demonstrates the efficiency and quality 
of our techniques. 

Keywords: Clustering, Ensembles, Consen- 
sus, Reproducing Kernel Hilbert Space. 

1 Introduction 

The problem of metaclustering has become important 
in recent years as researchers have tried to combine the 
strengths and weaknesses of different clustering algo- 
rithms to find patterns in data. A popular metaclus- 
tering problem is that of finding a consensus (or en- 
semble) partitiorj^ from among a set of candidate par- 
titions. Ensemble-based clustering has been found to 
be very powerful when different clusters are connected 
in different ways, each detectable by different classes 
of clustering algorithms [34]. For instance, no single 
clustering algorithm can detect clusters of symmetric 
Gaussian-like distributions of different density and clus- 
ters of long thinly-connected paths; but these clusters 
can be correctly identified by combining multiple tech- 
niques (i.e. /c-means and single-link) [13 . 
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■"^We use the term partition instead of clustering to represent 
a set of clusters decomposing a dataset. This avoids confusion 
between the terms 'cluster', 'clustering' and the procedure used 
to compute a partition, and will help us avoid phrases like, 
"We compute consensus clusterings by clustering clusters in 
clusterings!" 



Other related and important metaclustering prob- 
lems include finding a different and yet informative par- 
tition to a given one, or finding a set of partitions that 
are mutually diverse (and therefore informative). In all 
these problems, the key underlying step is comparing 
two partitions and quantifying the difference between 
them. Numerous metrics (and similarity measures) have 
been proposed to compare partitions, and for the most 
part they are based on comparing the combinatorial 
structure of the partitions. This is done either by exam- 
ining pairs of points that are grouped together in one 
partition and separated in another [29l [H |25l [T^, or 
by information-theoretic considerations stemming from 
building a histogram of cluster sizes and normalizing it 
to form a distribution [24, 34 . 

These methods ignore the actual spatial description 
of the data, merely treating the data as atoms in a set 
and using set information to compare the partitions. 
As has been observed by many researchers [38l |3l [7], 
ignoring the spatial relationships in the data can be 
problematic. Consider the three partitions in Figure [T] 
The first partition (FP) is obtained by a projection 
onto the y-axis, and the second (SP) is obtained via 
a projection onto the x-axis. Partitions (FP) and (SP) 
are both equidistant from partition (RP) under any of 
the above mentioned distances, and yet it is clear that 
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Figure 1: Why spatially-aware distances are important [7]: 
first and second partitions are equidistant from the refer- 
ence partition under a set-based distance. However, FP is 
clearly more similar to RP than SP (See 2D2C in Table]!]). 



(FP) is more similar to the reference partition, based on 
the spatial distribution of the data. 

Some researchers have proposed spatially- aware dis- 
tances between partitions [38l |3l [7] (as we review below 
in Section 1.2.3), but they all suffer from various defi- 



ciencies. They compromise the spatial information cap- 
tured by the clusters ([38l|3]), they lack metric proper- 
ties ([38l It]) (or have discontinuous ranges of distances 
to obtain metric properties ([3 )), or they are expensive 
to compute, making them ineffective for large data sets 

(HI)- 

1.1 Our Work. In this paper we exploit a concise, 
linear reproducing kernel Hilbert space (RKHS) repre- 
sentation of clusters. We use this representation to con- 
struct an efficient spatially-aware metric between parti- 
tions and an efficient spatially-aware consensus cluster- 
ing algorithm. 

We build on some recent ideas about clusters: (i) 
that a cluster can be viewed as a sample of data 
points from a distribution [19], (ii) that through a 
similarity kernel a distribution can be losslessly lifted 
to a single vector in a RKHS [26 and (iii) that the 
resulting distance between the representative vectors of 
two distributions in the RKHS can be used as a metric 
on the distributions [26l [32l [3T] . 

Representations. We first adapt the representa- 
tion of the clusters in the RKHS in two ways: approx- 
imation and normalization. Typically, vectors in an 
RKHS are infinite dimensional, but they can be approx- 
imated arbitrarily well in a finite-dimensional ^2 space 
that retains the linear structure of the RKHS [28l [20] . 
This provides concise and easily manipulated represen- 
tations for entire clusters. Additionally, we normalize 
these vectors to focus on the spatial information of the 
clusters. This turns out to be important in consensus 
clustering, as illustrated in Figure |4| 

Distance Computation. Using this convenient 
representation (an approximate normalized RKHS vec- 
tor), we develop a metric between partitions. Since the 
clusters can now be viewed as points in (scaled) Eu- 
clidean space we can apply standard measures for com- 
paring point sets in such spaces. In particular, we define 
a spatially- aware metric LiftEMD between partitions 
as the transportation distance [15 between the repre- 
sentatives, weighted by the number of points they rep- 
resent. While the transportation distance is a standard 
distance metric on probability distributions, it is expen- 
sive to compute (requiring O(n^) time for n points) [21]. 
However, since the points here are clusters, and the 
number of clusters {k) is typically significantly less than 
the data size (n), this is not a significant bottleneck as 
we see in Section |6l 



Consensus. We exploit the linearity of the RKHS 
representations of the clusters to design an efficient con- 
sensus clustering algorithm. Given several partitions, 
each represented as a set of vectors in an RKHS, we can 
find a partition of this data using standard Euclidean 
clustering algorithms. In particular, we can compute a 
consensus partition by simply running /c-means (or hier- 
archical agglomerative clustering) on the lifted represen- 
tations of each cluster from all input partitions. This 
reduction from consensus to Euclidean clustering is a 
key contribution: it allows us to utilize the extensive 
research and fast algorithms for Euclidean clustering, 
rather than designing complex hypergraph partitioning 
methods |34j. 

Evaluation. All of these aspects of our technical 
contributions are carefully evaluated on real-world and 
synthetic data. As a result of the convenient isometric 
representation, the well-founded metric, and reduction 
to many existing techniques, our methods perform well 
compared to previous approaches and are much more 
efficient. 

1.2 Background 

1.2.1 Clusters as Distributions. The core idea in 
doing spatially aware comparison of partitions is to treat 
a cluster as a distribution over the data, for example as 
a sum of ^-functions at each point of the cluster [7 or as 
a spatial density over the data [3 . The distance between 
two clusters can then be defined as a distance between 
two distributions over a metric space (the underlying 
spatial domain). 

1.2.2 Metrizing Distributions. There are stan- 
dard constructions for defining such a distance; the 
most well known metrics are the transportation dis- 
tance [15 (also known as the Wasserstein distance, the 
Kantorovich distance, the Mallows distance or the Earth 
mover's distance), and the Prokhorov metric [27]. An- 
other interesting approach was initiated by Miiller [26^, 
and develops a metric between general measures based 
on integrating test functions over the measure. When 
the test functions are chosen from a reproducing ker- 
nel Hilbert space [2 (RKHS), the resulting metric on 
distributions has many nice properties [32l [iTl [311 15] , 
most importantly that it can be isometrically embed- 
ded into the Hilbert space, yielding a convenient (but 
infinite dimensional) representation of a measure. 

This measure has been applied to the problem of 
computing a single clustering by Jegelka et. al. [19 . In 
their work, each cluster is treated as a distribution and 
the partition is found by maximizing the inter-cluster 
distance of the cluster representatives in the RKHS. We 
modify this distance and its construction in our work. 



A parallel line of development generalized this idea 
independently to measures over higher dimensional ob- 
jects (lines, surfaces and so on). The resulting metric 
(the current distance) is exactly the above metric when 
applied to 0-dimensional objects (scalar measures) and 
has been used extensively [36l [161 ISl HO] to compare 
shapes. In fact, thinking of a cluster of points as a 
"shape" was a motivating factor in this work. 

1.2.3 Distances Between Partitions. Section [T] 
reviews spatially-aware and space-insensitive ap- 
proaches to comparing partitions. We now describe 
prior work on spatially- aware distances between parti- 
tions in more detail. 




Figure 2: Dataset with 3 concentric circles, each rep- 
resenting a cluster partitioning the data. The CC dis- 
tance [38] can not distinguish between these clusters. 

Zhou et al [38 define a distance metric CC by re- 
placing each cluster by its centroid (this of course as- 
sumes the data does not lie in an abstract metric space), 
and computing a weighted transportation distance be- 
tween the sets of cluster centroids. Technically, their 
method yields a pseudo- metric, since two different clus- 
ters can have the same centroid, for example in the case 
of concentric ring clusters (Figure [2|. It is also oblivious 
to the distribution of points within a cluster. 

Coen et al [7 avoid the problem of selecting a 
cluster center by defining the distance between clus- 
ters as the transportation distance between the full sets 
of points comprising each cluster. This yields a met- 
ric on the set of all clusters in both partitions. In a 
second stage, they define the similarity distance CDlS- 
TANCE between two partitions as the ratio between the 
transportation distance between the two partitions (us- 
ing the metric just constructed as the base metric) and 
a "non-informative" transportation distance in which 
each cluster center distributes its mass equally to all 
cluster centers in the other partition. While this mea- 
sure is symmetric, it does not satisfy triangle inequality 
and is therefore not a metric. 

Bae et al [3] take a slightly different approach. They 



build a spatial histogram over the points in each cluster, 
and use the counts as a vector signature for the cluster. 
Cluster similarity is then computed via a dot product, 
and the similarity between two partitions is then defined 
as the sum of cluster similarities in an optimal matching 
between the clusters of the two partitions, normalized 
by the self-similarity of the two partitions. 
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Figure 3: Data set with 4 clusters, each a set grouped in a 
single grid cell. The two clusters with blue open circles are 
as close as the two clusters with filled red triangles under 
the i^ADCO distance [3J. 

In general, such a spatial partitioning would require 
a number of histogram bins exponential in the dimen- 
sion; they get around this problem by only retaining in- 
formation about the marginal distributions along each 
dimension. One weakness of this approach is that only 
points that fall into the same bin contribute to the over- 
all similarity. This can lead dissimilar clusters to be 
viewed as similar; in Figure |3j the two A (red) clus- 
ters will be considered as similar as the two Q (blue) 
clusters. 

Their approach yields a similarity, and not a dis- 
tance metric. In order to construct a metric, they have 
to do the usual transformation dist = 1 — sim and then 
add one to each distance between non-identical items, 
which yields the somewhat unnatural (and discontinu- 
ous) metric I^adco- Their method also implicitly as- 
sumes (like Zhou et al [38 ) that the data lies in Eu- 
clidean space. 

Our approach. Our method, centered around the 
RKHS-based metric between distributions, addresses all 
of the above problems. It yields a true metric, incorpo- 
rates the actual distribution of the data correctly, and 
avoids exponential dependency on the dimension. The 
price we pay is the requirement that the data lie in a 
space admitting a positive definite kernel. However, this 
actually enables us to apply our method to clustering 
objects like graphs and strings, for which similarity ker- 
nels exist [T2j |23] but no convenient vector space repre- 
sentation is known. 

1.2.4 Consensus Clustering Algorithms. One of 

the most popular methods for computing consensus be- 
tween a collection of partitions is the majority rule: for 



each pair of points, each partition "votes" on whether 
the pair of points is in the same cluster or not, and the 
majority vote wins. While this method is simple, it is 
expensive and is spatially-oblivious; two points might 
lie in separate clusters that are close to each other. 

Alternatively, consensus can be defined via a 1- 
median formulation: given a distance between parti- 
tions, the consensus partition is the partition that min- 
imizes the sum of distances to all partitions. If the 
distance function is a metric, then the best partition 
from among the input partitions is guaranteed to be 
within twice the cost of the optimal solution (via trian- 
gle inequality). In general, it is challenging to find an 
arbitrary partition that minimizes this function. For ex- 
ample, the above majority-based method can be viewed 
as a heuristic for computing the 1-median under the 
Rand distance, and algorithms with formal approxima- 
tions exist for this problem [14 . 

Recently Ansari et al [1 extended these above 
schemes to be spatially- aware by inserting C Distance 
in place of Rand distance above. This method is suc- 
cessful in grouping similar clusters from an ensemble 
of partitions, but it is quite slow on large data sets P 
since it requires computing (It (defined in Section 2.1) 
on the full dataset. Alternatively using representations 
of each cluster in the ambient space (such as its mean, 
as in CC [38]) would produce another spatially- aware 
ensemble clustering variant, but would be less effective 
because its representation causes unwanted simplifica- 
tion of the clusters; see Figure [2] 

2 Preliminaries 

2.1 Definitions. Let P be the set of points being 
clustered, with \P\ = n. We use the term cluster to 
refer to a subset C of P (i.e an actual cluster of the 
data), and the term partition to refer to a partitioning 
of P into clusters (i.e what one would usually refer 
to as a clustering of P). Clusters will always be 
denoted by the capital letters A, P, C, . . ., and partitions 
will be denoted by the symbols yi,^,C, — We will 
also consider soft partitions of P, which are fractional 
assignments {p{C\x)} of points x to clusters C such that 
for any x, the assignment weights p{C\x) sum to one. 

We will assume that P is drawn from a space X 
endowed with a reproducing kernel n : X x X ^ 
R 0. The kernel k induces a Hilbert space via 
the lifting map ^ \ X ^ with the property that 
n{x^y) = (<I>(x), <!>(?/) )^, (•, •),^ being the inner product 
that defines H/^. 

Let p, q be probability distributions defined over 
X. Let IF be a set of real- valued bounded measurable 
functions defined over X. Let J^ = {f eJ \ ||/|U < 1} 
denote the unit hall in the Hilbert space The 



integral probability metric [26] 7/^ on distributions p, q 
is defined as 7^(p, q) = sup^^gr^ | fdp - fdq\. We 
will make extensive use of the following explicit formula 
for 7^(p, q): 

(2.) .^(P,,)^ 



n{x,y)dq{x)dq{y) 



-2 JJ^f^{x,y)dp{x)dq{y), 



which can be derived (via the kernel trick) from the fol- 
lowing formula for [32j: Jk{p^q) = W fx ^('^ x)dp{x) — 
J-^ i<i{-^x)dq{x)\\^^. This formula also gives us the lifting 
map <l>, since we can write ^{p) = i<i{-,x)dp{x). 

The transportation metric. Let D : X x 
X ^ R be a metric over X. The transporta- 
tion distance between p and q is then defined as 



drip, q) 



inf 



f:XxX- 
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such that /(x, y)dx = q{y) and /(x, y)dy = p{x). 
Intuitively, f{x^y)D{x,y) measures the work in trans- 
porting f{x^y) mass from p{x) to q{y). 

2.2 An RKHS Distance Between Clusters. We 

use to construct a metric on clusters. Let C C P be 
a cluster. We associate with C the distribution p{C) = 
'^xP{C\x)w{x)Sx{-), where Sx{-) is the Kronecker ^- 
function and w : P [0, 1] is a weight function. 
Given two clusters C, C C P, we define d{C^ C) = 
7.(p(C),p(C')). 

An example. A simple example illustrates how 
this distance generalizes pure partition- based distances 
between clusters. Suppose we fix the kernel n{x^ y) to be 
the discrete kernel: hz{x^x) = 1, n{x^y) = Wx ^ y. 
Then it is easy to verify that d{C^ C) = ^^\C/S.C'\ 
is the square root of the cardinality of the symmetric 
difference C/S.C' , which is a well known set-theoretic 
measure of dissimilarity between clusters. Since this 
kernel treats all distinct points as equidistant from 
each other, the only information remaining is the set- 
theoretic difference between the clusters. As n acquires 
more spatial information, d{C^ C) incorporates this into 
the distance calculation. 

2.2.1 Representations in There is an elegant 
way to represent points, clusters and partitions in the 
RKHS H/^. Define the lifting map ^{x) = hi{-^x). This 
takes a point x G P to a vector $(x) in H^. A cluster 
C C P can now be expressed as a weighted sum of 

these vectors: $(C) = Z^^cgc ^(^)^(^)- ^^^^ ^^^^ 
for clarity, in what follows we will assume without loss 
of generality that all partitions are hard; to construct 
the corresponding soft partition-based expression, we 



merely replace terms of the form {x e C} = Ixec by 
the probability p{C\x). 

^{C) is also a vector in H,^, and we can now rewrite 
d{C,C') as d{C,C') = mC) - $(C')|k„ 

Finally, a partition IP = {Ci, C2, . . . Cfe} of P can be 
represented by the set of vectors $(^) = {$((7^)} in H^. 
We note that as long as the kernel is chosen correctly 
[32] , this mapping is isometric, which implies that the 
representation ^{C) is a lossless representation of C. 

The linearity of representation is a crucial feature 
of how clusters are represented. While in the original 
space, a cluster might describe an unusually shaped 
collection of points, the same cluster in is merely 
the weighted sum of the corresponding vectors ^{x). 
As a consequence, it is easy to represent soft partitions 
as well. A cluster C can be represented by the vector 
HC) = E^w{x)p{C\x)^x). 

2.2.2 An RHKS-based Clustering. Jegelka et 
al used the RKHS-based representation of clusters 
to formulate a new cost function for computing a single 
clustering. In particular, they considered the optimiza- 
tion problem of finding the partition ^ = {Ci,C2} of 
two clusters to maximize 



c(a')=|Ci|-|C2|-||$(Ci)-$(C2)||^, 

+ Ai||$(Ci)||^^ + A2||$(C2 



for various choices of kernel and regularization terms 
Al and A2. They mention that this could then be gen- 
eralized to find an arbitrary /c-partition by introducing 
more regularizing terms. Their paper focuses primarily 
on the generality of this approach and how it connects 
to other clustering frameworks, and they do not discuss 
algorithmic issues in any great detail. 

3 Approximate Normalized Cluster 

Representation 

We adapt the RKHS-based representation of clusters 
^(C) in two ways to make it more amenable to our 
meta-clustering goals. First, we approximate ^{C) to a 
finite dimensional (p-dimensional) vector. This provides 
a finite representation of each cluster in (as opposed 
to a vector in the infinite dimensional ^K^), it retains 
linearity properties, and it allows for fast computation 
of distance between two clusters. Second, we normalize 
^{C) to remove any information about the size of 
the cluster; retaining only spatial information. This 
property becomes critical for consensus clustering. 

3.1 Approximate Lifting Map The lifted rep- 
resentation ^{x) is the key to the representation of clus- 
ters and partitions, and its computation plays a critical 
role in the overall complexity of the distance computa- 



tion. For kernels of interest (like the Gaussian kernel), 
$(x) cannot be computed explicitly, since the induced 
RKHS is an infinite-dimensional function space. 

However, we can take advantage of the shift- 
invariance of commonly occurring kernel^ For these 
kernels a random projection technique in Euclidean 
space defines an approximate lifting map i> : X x X — ^ 
with the property that for any x, G P, 



Mx) - ^y)h - mx) - ^y)\\n^ 



where £ > is an arbitrary user defined parameter, and 
p = p{e). Notice that the approximate lifting map takes 
points to £2 with the standard inner product, rather 
than a general Hilbert space. 

The specific construction is due to Rahimi and 
Recht [28 and analyzed by Joshi et al [20 , to yield 
the following result: 

Theorem 3.1. ([20j) Given a set of n points P c 
X, shift- invariant kernel k : X x X ^ R and any 
£ > 0, there exists a map ^ : X x X ^ 
p = 0((l/6:^) logn)^ such that for any x^y G P, 

|$(x)-$(y)||2-||$(x)-$(y)||„^ <e 

The actual construction is randomized and yields 
a <i> as above with probability 1 — ^, where p = 
0((l/£^) log(n/(5)). For any x, constructing the vector 
^{x) takes 0{p) time. 

3.2 Normalizing ^{C). The lifting map ^ is lin- 
ear with respect to the weights of the data points, 
while being nonlinear in their location. Since ^(C) = 
^xec ^{x)^{x), this means that any scaling of the vec- 
tors ^{x) translates directly into a uniform scaling of 
the weights of the data, and does not affect the spatial 
positioning of the points. This implies that we are free 
to normalize the cluster vectors ^(C), so as to remove 
the scale information, retaining only, and exactly, the 
spatial information. In practice, we will normalize the 
cluster vectors to have unit length; let 

$(C) = $(C)/||$(C)|K. 

Figure |4] shows an example of why it is important 
to compare RKHS representations of vectors using 
only their spatial information. In particular, without 
normalizing, small clusters C will have small norms 
[[^(C'jll^K^, and the distance between two small vectors 
||*(Ci)-i(C2)|k. is at most ||*(C2)|k. + ||*(C2)|k.. 
Thus all small clusters will likely have similar unnormal- 
ized RKHS vectors, irrespective of spatial location. 



-^A kernel K{x,y) defined on a vector space is shift-invariant if 
it can be written as hi(x,y) = g(x — y). 
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Figure 4: Two partitions IPi = {Ai^Bi} and 'P2 — 
{A2j B2^C2} of the same dataset, and a 2-d visualization 
of all of their representations in an RKHS. Note that the 
unnormalized vectors (on the left) have ^(^i) far from 
$(^2) and ^{€2) even though the second two are subsets 
of the first. The normalized vectors (on the right) have 
^{Bi) close to both 4>(52) and <I(C2). In particular, 
^{02) is closer to ^{Ai) than ^{Bi), but 4>(C2) is much 
closer to ^{Bi) than 

3.3 Computing the Distance Between Clusters. 

For two clusters C, C ^ we defined the distance between 
them as d{C,C') = j^{p{C),p{a)). Since the two 
distributions p{C) and p{C^) are discrete (defined over 
\C\ and \C^\ elements respectively), we can use (2.1) 
to compute d{C,C') in time 0{\C\ • \C'\). While this 
may be suitable for small clusters, it rapidly becomes 
expensive as the cluster sizes increase. 

If we are willing to approximate d{C^C')^ we can 
use Theorem |3.1| combined with the implicit definition 
of d{C,C') as \\^{C) - ^{C')\\u^. Each cluster C is 
represented as a sum of p-dimensional vectors, and 
then the ^2 distance between the resulting vectors can be 
computed in 0{p) time. The following approximation 
guarantee on d{C^ C) then follows from the triangle 
inequality and an appropriate choice of e. 

Theorem 3.2. For any two clusters C^C and any 
e > 0, d{C^C') can he approximated to within an 
additive error e in time 0((|C| + \C'\)p) time, where 
p = 0((l/£2)logn). 



4 New Distances between Partitions 

Let y = {Ci,C2,...} and r = {C{,C^,...} 
be two different partitions of P witli associated 
representations = {$(Ci), $(C2), . . .} and 

$(V) = {$(C(),*(C^)^...}. Similarly^ ^9) = 
{l>(Ci),l>(C2),...} and $(0") = {#(C1),I>(C^),...}. 
Since the two representations are sets of points in a 
Hilbert space, we can draw on a number of techniques 
for comparing point sets from the world of shape match- 
ing and pattern analysis. 

We can apply the transportation distance dT on 
these vectors to compare the partitions, treating the 
partitions as distributions. In particular, the partition 
y is represented by the distribution 



(4.1) 



2^ fpT ' ^^(^) 



$(c)e$(y) 



where ^i.(c') is a Dirac delta function at ^{C) G IK^, 
with as the underlying metric. We will refer to this 
metric on partitions as 

• LiFTEMD(J^,r) =dT(4>(:P),<S(:P0). 

An example, continued. Once again, we can 
simulate the loss of spatial information by using the 
discrete kernel as in Section \2.^ The transportation 



metric is computed (see Section 2.1) by minimizing 
a functional over all partial assignments f{x^y). If 
we set f{C^C') = |C D C'\/n to be the fraction of 
points overlapping between clusters, then the resulting 
transportation cost is precisely the Rand distance f2^ 
between the two partitions! This observation has two 
implications. First, that standard distance measures 
between partitions appear as special cases of this general 
framework. Second, LlFTEMD(y, y^) will always be at 
most the Rand distance between ^ and y . 
We can also use other measures. Let 



max mm \\v 



Then the Hausdorff distance [6^ is defined as 

^^($(a^),^(r)) = 

max (d~^H my), ^y')) , d~H my'), ^m) . 

We refer to this application of the Hausdorff distance to 
partitions as 

• hwTYi(y,r) = dH{Hy).Hy'))- 

We could also use our lifting map again. Since 
we can view the collection of points $(y) as a spatial 
distribution in (see ( |4.1| )), we can define 7^/ in this 
space as well, with n' again given by any appropriate 
kernel (for example, n'{y.,w) = exp(— — w\W^ J. We 
refer to this metric as 



4.1 Computing the distance between partitions 

The lifting map ^ (and its approximation ^) create ef- 
ficiency in two ways. First, it is fast to generate a rep- 
resentation of a cluster C {0{\C\p) time), and second, 
it is easy to estimate the distance between two clusters 
(0(p) time). This implies that after a linear amount 
of processing, all distance computations between par- 
titions depend only on the number of clusters in each 
partition, rather than the size of the input data. Since 
the number of clusters is usually orders of magnitude 
smaller than the size of the input, this allows us to 
use asymptotically inefficient algorithms on ^{'P) and 
^(IP') that have small overhead, rather than requiring 
more expensive (but asymptotically cheaper in k) pro- 
cedures. Assume that we are comparing two partitions 
with k and k' clusters respectively. LiftEMD is 
computed in general using a min-cost flow formulation 
of the problem, which is then solved using the Hungar- 
ian algorithm. This algorithm takes time 0{{k + k^)^). 
While various approximations of exist [30l |T8] , the 
exact method suffices for our setting for the reasons 
mentioned above. 

It is immediate from the definition of LiftH that 
it can be computed in time 0{k • k') by a brute 
force calculation. A similar bound holds for exact 
computation of LiftKD. While approximations exist 
for both of these distances, they incur overhead that 
makes them inefficient for small k. 

5 Computing Consensus Partitions 

As an application of our proposed distance between par- 
titions, we describe how to construct a spatially-aware 
consensus from a collection of partitions. This method 
reduces the consensus problem to a standard clustering 
problem, allowing us to leverage the extensive body of 
work on standard clustering techniques. Furthermore, 
the representations of clusters as vectors in allows 
for very concise representation of the data, making our 
algorithms extremely fast and scalable. 

5.1 A Reduction from Consensus Finding to 
Clustering. Our approach exploits the linearity of 
cluster representations in H/^, and works as follows. Let 
IPi, ^2, • • • be the input (hard or soft) partitions of 
P. Under the lifting map ^, each partition can be 
represented by a set of points {<I>(1P^)} in 1-L,^. Let 
Q = ^(^i) be the collection of these points. 

Definition 5.1. A (soft) consensus k-partition of 
^1, ^2, • • • ^5 a partition 'Peon of [j- Pi into k clus- 
ters {C'i,...,C^} that minimizes the sum of squared 



distances from each ^{Cij) G 4>(lPi) to its associ- 
ated ^{Ci) G ^con- Formally, for a set of k vectors 
V = {^1, . . . , v/e} C IK^ define 



LlFTSSD({J^a, V)= V min p(Q 



and then define Peon cls the minimum such set 

?con = argmin LlFTSSD({a^,}„ F*). 

v*={v*,...,vi}e:K^ 



How do we interpret J^c 



? 



Observe that each 



element in Q is the lifted representation of some cluster 
Cij in some partition IP^, and therefore corresponds to 
some subset of P. Consider now a single cluster in 
^con- Since ^K^ is linear and ^con minimizes distance to 
some set of cluster representatives, it must be in their 
linear combination. Hence it can be associated with a 
weighted subset of elements of ^(P), and is hence a soft 
partition. It can be made hard by voting each point 
X e P to the representative Ci G ^con for which it has 
the largest weight. 



Figure 5.1 shows an example of three partitions, 
their normalized RKHS representative vectors, and two 
clusters of those vectors. 

5.2 Algorithm. We will use the approximate lifting 
map ^ in our procedure. This allows us to operate 
in a p-dimensional Euclidean space, in which there 
are many clustering procedures we can use. For our 
experiments, we will use both /c- means and hierarchical 
agglomerative clustering (HAG). That is, let LiftKm 
be the algorithm of running /c-means on *^*(^i), and 
let LiftHAC be the algorithm of running HAC on 
[j - ^{'Pi). For both algorithms, the output is the (soft) 
clusters represented by the vectors in ^{'Pcon)- Our 
results will show that the particular choice of clustering 
algorithm (e.g. LiftKm or LiftHAC) is not crucial. 
There are multiple methods to choose the right number 
of clusters and we can employ one of them to fix k for 
our consensus technique. Algorithm [l] summarizes our 
consensus procedure. 

Cost analysis. Computing Q takes 0{mnp) = 
0{mn\ogn) time. Let \Q\ = s. Computing 'Peon 
is a single call to any standard Euclidean algorithm 
like /c-means that takes time 0{skp) per iteration, and 
computing the final soft partition takes time linear in 
n{p -\- k) -\- 8. Note that in general we expect that 
k^s <^ n. In particular, when s < n and m is assumed 
constant, then the runtime is 0{n{k + logn)). 

6 Experimental Evaluation 

In this section we empirically show the effectiveness 
of our distance between partitions, LiftEMD and 



♦ 













Algorithm 1 Consensus finding 




Figure 5: Three partitions = {Ai^Bi}, ^2 = 
{A2,B2,C2}, ^3 = {A3, 53,03} of the same dataset, 
and a 2-d visualization of all of their representations in 
an RKHS. These vectors are then clustered into k = 2 
consensus clusters consisting of {4>(Ai), 4>(A2), 5>(A3)} 
and §(^2), ^(^2), §(^3), HCs)}. 

LiftKD, and the consensus clustering algorithms that 
conceptually follow, LiftKm and LiftHAC. 

Data. We created two synthetic datasets in 
namely, 2D2C for which data is drawn from 2 Gaussians 
to produce 2 visibly separate clusters and 2D3C for 
which the points are arbitrarily chosen to produce 3 vis- 
ibly separate clusters. We also use 5 different datasets 
from the UCI repository [TT] (Wine, Ionosphere, Glass, 
Iris, Soybean) with various numbers of dimensions and 
labeled data classes. To show the ability of our con- 
sensus procedure and the distance metric to handle 
large data, we use both the training and test data of 
MNIST [22 database of handwritten digits which has 
60,000 and 10,000 examples respectively in R^^^. 

Methodology. We will compare our approach 
with the partition-based measures namely. Rand Dis- 
tance and Jaccard distance and information-theoretic 
measures namely. Normalized Mutual Information and 
Normalized Variation of Information [37 , as well as 
the spatially- aware measures I^adco [3^ and CDlS- 



Input: (soft) Partitions J^i, ^2, • • • of P, kernel 
function n 

Output: Consensus (soft) partition 

^ con 

1: Set Q = UM^^) 

2: Compute = {vt,"'Vl} C to minimizes 
liftSSD(Q,F*). 

(Via /c-means for LiftKm or HAC for LiftHAC) 
3: Assign each p e P to the cluster Ci G ^con 
associated with the vector G 

• for which {^{p)^Vi) is maximized for a hard 
partition, or 

• with weight proportional to {^{p),Vi) for a soft 
partition. 



4: Output ?c 



TANCE [7 . We ran /c-means, single- linkage, average- 
linkage, complete-linkage and Ward's method [35 on 
the datasets to generate input partitions to the con- 
sensus clustering methods. We use accuracy [8 and 
Rand distance [29 to measure the effectiveness of the 
consensus clustering algorithms by comparing the re- 
turned consensus partitions to the original class la- 
beling. We compare our consensus technique against 
few hypergraph partitioning based consensus methods 
CSPA, HGPA and MCLA [34]. For the MNIST data we 
also visualize the cluster centroids of each of the input 
and consensus partitions in a 28x28 grayscale image. 

Accuracy studies the one-to-one relationship be- 
tween clusters and classes; it measures the extent to 
which each cluster contains data points from the corre- 
sponding class. Given a set P of n elements, consider 
a set of k clusters IP = {Ci, . . . , (7^} and m > k classes 
L = {Li, . . . , Lm} denoting the ground truth partition. 
Accuracy is expressed as 



A(J^,£) 



— 

1=1 



where fi assigns each cluster to a distinct class. The 
Rand distance counts the fraction of pairs which are 
assigned consistently in both partitions as in the same 
or in different classes. Let Rs{y^^) be the number of 
pairs of points that are in the same cluster in both IP and 
and let Roi^^'^) be the number of pairs of points 
that are in different clusters in both y and L. Now we 
can define the Rand distance as 



R(a^,£) = l- 



i?5(y,^)+i^D(y,^) 
(2) 



Code. We implement <l> as the Random Projection 
feature map [20] in C to lift each data point into R^. 
We set p = 200 for the two synthetic datasets 2D2C 
and 2D3C and all the UCI datasets. We set p = 4000 
for the larger datasets, MNIST training and MNIST 
test. The same lifting is applied to all data points, and 
thus all clusters. 

The LiftEMD, LiftKD, and LiftH distances be- 
tween two partitions IP and IP' are computed by invok- 
ing brute-force transportation distance, kernel distance, 
and Hausdorff distance on <I(J^), 4>(y') C W represent- 
ing the lifted clusters. 

To compute the consensus clustering in the lifted 
space, we apply /c-means (for LiftKm) or HAC (for 
LiftH AC) (with the appropriate numbers of clusters) 
on the set Q C of all lifted clusters from all parti- 
tions. The only parameters required by the procedure 
is the error term e (needed in our choice of p) associated 
with and any clustering-related parameters. 

We used the cluster analysis functions in MATLAB 
with the default settings to generate the input parti- 
tions to the consensus methods and the given number of 
classes as the number of clusters. We implemented the 
algorithm provided by the authors [3 in MATLAB to 
compute I^ADCO- To compute CDistance, we used the 
code provided by the authors [7]. We used the Cluster- 
Pack MATLAB toolbox ^ to run the hypergraph par- 
titioning based consensus methods CSPA, HGFA and 
MCLA. 

6.1 Spatial Sensitivity. We start by evaluating the 
sensitivity of our method. We consider three partitions- 
the ground truth (or) reference partition (i^P), and 
manually constructed first and second partitions {FP 
and SP) for the datasets 2D2C (see Figure [T]) and 
2D3C (see Figure [6|. For both the datasets the 
reference partition is by construction spatially closer 
to the first partition than the second partition, but 
each of the two partitions are equidistant from the 
reference under any partition-based and information- 
theoretic measures. Table [l] shows that in each example, 
our measures correctly conclude that RP is closer to FP 
than it is to SP. 

6.2 Efficiency. We compare our distance computa- 
tion procedure to CDistance. We do not compare 
against I^adco and CC because they are not well- 
founded. Both LiftEMD and CDistance compute 
(It between clusters after an initial step of either lifting 
to a feature space or computing dr between all pairs of 
clusters. Thus the proper comparison, and runtime bot- 
tleneck, is the initial phase of the algorithms; LiftEMD 
takes 0(n log n) time whereas CDistance takes O(n^) 




(a) Reference Par- (b) First Partition (c) Second Parti- 
tition {RP) (FP) tion (SP) 



Figure 6: Different partitions of 2D3C dataset. 

time. Table [2] summarizes our results. For instance, 
on the 2D3C data set with n = 24, our initial phase 
takes 1.02 milliseconds, and CDistance's initial phase 
takes 2.03 milliseconds. On the Wine data set with 
n = 178, our initial phase takes 6.9 milliseconds, while 
CDistance's initial phase takes 18.8 milliseconds. As 
the dataset size increases, the advantage of LiftEMD 
over CDistance becomes even larger. On the MNIST 
training data with n = 60, 000, our initial phase takes 
a little less than 30 minutes, while CDistance's initial 
phase takes more than 56 hours. 

6.3 Consensus Clustering. We now evaluate our 
spatially- aware consensus clustering method. We do 
this first by comparing our consensus partition to the 
reference solution based on using the Rand distance (i.e 
a partition-based measure) in Table [sj Note that for all 
data sets, our consensus clustering methods (LiftKm 
and LiftH AC) return answers that are almost always 
as close as the best answer returned by any of the hy- 
pergraph partitioning based consensus methods CSPA, 
HGPA, or MCLA. We get very similar results using the 
accuracy [8 measure in place of Rand. 

In Table [4j we then run the same comparisons, 
but this time using LiftEMD (i.e a spatially-aware 
measure). Here, it is interesting to note that in all 
cases (with the slight exception of Ionosphere) the 
distance we get is smaller than the distance reported by 
the hypergraph partitioning based consensus methods, 
indicating that our method is returning a consensus 
partition that is spatially closer to the true answer. The 
two tables also illustrate the flexibility of our framework, 
since the results using LiftKm and LiftHAC are 
mostly identical (with one exception being IRIS under 
LiftEMD). 

To summarize, our method provides results that 
are comparable or better on partition-based measures of 
consensus, and are superior using spatially- aware mea- 
sures. The running time of our approach is comparable 



Technique 


Dataset 2D2C 


Dataset 2D3C 


d{RP,FP) 


d{RP,SP) 


d{RP,FP) 


d{RP,SP) 


^ADCO 


1.710 


1.780 


1.790 


1.820 


CDlSTANCE 


0.240 


0.350 


0.092 


0.407 


LiftEMD 


0.430 


0.512 


0.256 


0.310 


LiftKD 


0.290 


0.325 


0.243 


0.325 


LiftH 


0.410 


0.490 


1.227 


1.291 



Table 1: Comparing Partitions. Each cell indicates the distance returned under the methods along the rows for the 
dataset in the column. Spatially, the left column of each data set {2D2C or 2D3C) should be smaller than the right 
column; this holds for all 5 spatial measures/algorithms tested. In all cases, the two partition-based measures and the 
two information-theoretic measures yield the same values for d(/?P, FP) and d(/?P, SP), but are not shown. 



Dataset 


Number of points 


Number of dimensions 


CDistance 


LiftEMD 


2D3C 


24 


2 


2.03 ms 


1.02 ms 


2D2C 


45 


2 


4.10 ms 


1.95 ms 


Wine 


178 


13 


18.80 ms 


6.90 ms 


MNIST test data 


10,000 


784 


1360.20 s 


303.90 s 


MNIST training data 


60,000 


784 


202681 s 


1774.20 s 



Table 2: Comparison of runtimes: Distance between true partition and partition generated by /c-means) 



to the best hypergraph partitioning based approaches, 
so using our consensus procedure yields the best overall 
result. 

We also run consensus experiments on the MNIST 
test data and compare against CSPA and MCLA. We 
do not compare against HGPA since it runs very slow 
for the large MNIST datasets (n = 10,000); it has 
quadratic complexity in the input size, and in fact, the 
authors do not recommend this for large data. Figure [7| 
provides a visualization of the cluster centroids of input 
partitions generated using /c-means, complete linkage 
HAG and average linkage HAG and the consensus 
partitions generated by GSPA and LiftKm. From the 
/c-means input, only 5 clusters can be easily associated 
with digits {0, 3, 6, 8, 9); from the complete linkage 
HAG input, only 7 clusters can be easily associated with 
digits {0, 1 , 3, 6, 7, 8, 9)] and from the average linkage 
HAG output, only 6 clusters can be easily associated 
with digits {0, 1, 2, 3, 7, 9). The partition that we 
obtain from running GSPA also lets us identify up to 6 
digits {0, 1, 2, 3, 8, 9). In all the above three partitions, 
there occurs cases where two clusters seem to represent 
the same digit. In contrast, we can identify 9 digits {0, 
1, 2, 3, 4, 5, 7, 8, 9) with only the digit 6 being noisy 
from our LiftKm output. 

6.4 Error in ^, There is a tradeoff between the 
desired error e in computing LiftEMD and the number 
of dimensions p needed for 4>. Figure [s] shows the error 
as a function of p on the 2D2C dataset (n = 45). From 
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Figure 7: 28x28 pixel representation of the cluster cen- 
troids for MNIST test input partitions generated using (a) 
/c-means, (b) complete linkage HAC, and (c) average link- 
age HAC, and the consensus partitions generated by (d) 
CSPA and (e) LiftKm. 

the chart, we can see that p = 100 dimensions suffice to 
yield a very accurate approximation for the distances. 
Figure [9] shows the error as a function of p on the MNIST 
training dataset that has n = 60, 000 points. From the 
chart, we can see that p = 4, 000 dimensions suffice to 
yield a very accurate approximation for the distances. 

7 Conclusions 

We provide a spatially-aware metric between partitions 
based on a RKHS representation of clusters. We also 
provide a spatially- aware consensus clustering formula- 
tion using this representation that reduces to Euclidean 
clustering. We demonstrate that our algorithms are 
efficient and are comparable to or better than prior 
spatially- aware and non-spatially-aware methods. 



J-^olLoloCt 


r!SPA KHPA MCI, A 


TjtttTCm TjTTTl-TAr^ 


IRIS 


0.088 0.270 0.115 


0.114 0.125 


Glass 


0.277 0.305 0.428 


0.425 0.430 


Ionosphere 


0.422 0.502 0.410 


0.420 0.410 


Soybean 


0.188 0.150 0.163 


0.150 0.154 


Wine 


0.296 0.374 0.330 


0.320 0.310 


MNIST test data 


0.149 - 0.163 


0.091 0.110 



Table 3: Comparison of LiftKm and LiftHAC with hypergraph partitioning based consensus methods under the 
Rand distance (with respect to ground truth). The numbers are comparable across each row corresponding to a different 
dataset, and smaller numbers indicate better accuracy. The top two methods for each dataset are highlighted. 



Dataset 


CSPA HGFA MCLA 


LiftKm LiftHAC 


IRIS 


0.113 0.295 0.812 


0.106 0.210 


Glass 


0.573 0.519 0.731 


0.531 0.540 


Ionosphere 


0.729 0.767 0.993 


0.731 0.720 


Soybean 


0.510 0.495 0.951 


0.277 0.290 


Wine 


0.873 0.875 0.917 


0.831 0.842 


MNIST test data 


0.182 - 0.344 


0.106 0.112 



Table 4: Comparison of LiftKm and LiftHAC with hypergraph partitioning based consensus methods under 
LiftEMD (with respect to ground truth). The numbers are comparable across each row corresponding to a different 
dataset, and smaller numbers indicate better accuracy. The top two methods for each dataset are highlighted. 




Figure 8: Error in LiftEMD on 2D2C dataset (45 Figure 9: Error in LiftEMD on MNIST training data 
samples) as a function of p. (60,000 samples) as a function of p. 
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